{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# CPSC 330 hw2\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "plt.rcParams['font.size'] = 16\n",
    "\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.model_selection import train_test_split, cross_val_score, cross_validate"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Instructions\n",
    "rubric={points:3}\n",
    "\n",
    "Follow the [homework submission instructions](https://github.com/UBC-CS/cpsc330/blob/master/docs/homework_instructions.md). In particular, **see the note about not pushing downloaded data to your repo**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introducing the data set\n",
    " \n",
    "For this  assignment you'll be looking at Kaggle's [Spotify Song Attributes](https://www.kaggle.com/geomack/spotifyclassification/) dataset.\n",
    "The dataset contains a number of features of songs from 2017 and a binary variable `target` that represents whether the user liked the song (encoded as 1) or not (encoded as 0). See the documentation of all the features [here](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/). \n",
    "\n",
    "This dataset is publicly available on Kaggle, and you will have to download it yourself. Follow the steps below to get the data CSV. \n",
    "\n",
    "1. If you do not have an account with [Kaggle](https://www.kaggle.com/), you will first need to create one (it's free).\n",
    "2. Login to your account and [download](https://www.kaggle.com/geomack/spotifyclassification/download) the dataset.\n",
    "3. Unzip the data file if needed, then rename it to `spotify.csv`, and move it to the same directory as this notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 1: Exploratory data analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-d4d478b6cdc9bf88",
     "locked": true,
     "schema_version": 3,
     "solution": false
    }
   },
   "source": [
    "#### 1(a) \n",
    "rubric={points:1}\n",
    "\n",
    "Read in the data CSV and store it as a pandas dataframe named `spotify_df`. The first column of the .csv file should be set as the index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "cell-4f3f14b59fd7e6b8",
     "locked": false,
     "points": 0,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1(b)\n",
    "rubric={points:2}\n",
    "\n",
    "Run the following line of code to split the data. How many training and test examples do we have?\n",
    "\n",
    "Note: we are setting the `random_state` so that everyone has the same split on their assignments. This will make it easier for the TAs to grade."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_train, df_test = train_test_split(spotify_df, test_size=0.2, random_state=321)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1(c)\n",
    "rubric={points:2}\n",
    "\n",
    "- Print out the output of `describe()` **on the training split**. This will compute some summary statistics of the numeric columns.\n",
    "- Which feature has the smallest range (max-min)? \n",
    "\n",
    "Note that `describe` returns another DataFrame."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-b33320bcf667584a",
     "locked": true,
     "schema_version": 3,
     "solution": false
    }
   },
   "source": [
    "#### 1(d) \n",
    "rubric={points:5}\n",
    "\n",
    "Let's focus on the following features:\n",
    "\n",
    "- danceability\n",
    "- tempo\n",
    "- energy\n",
    "- valence\n",
    "\n",
    "For each of these features (in order), produce a histogram that shows the distribution of the feature values in the training set, **separated for positive and negative examples**. \n",
    "By \"positive examples\" we mean target = 1 (user liked the song, positive sentiment) and by \"negative examples\" we mean target = 0 (used disliked the song, negative sentiment). As an example, here is what the histogram would look like for a different feature, loudness:\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](loudness.png)\n",
    "\n",
    "(You don't have to match all the details exactly, such as colour, but your histograms should look something like this, with a reasonable number of bins to see the shape of the distribution.) As shown above, there are two different histograms, one for target = 0 and one for target = 1, and they are overlaid on top of each other. The histogram above shows that extremely quiet songs tend to be disliked (more blue bars than orange on the left) and very loud songs also tend to be disliked (more blue than orange on the far right).\n",
    "\n",
    "To adhere to the [DRY (Don't Repeat Yourself)](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself) principle, make sure you use a `for` loop for your plotting, rather than repeating the plotting code 4 times. For this to work, I used `plt.show()` at the end of your loop, which draws the figure and resets the canvas for your next plot."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is some code that separates out the dataset into positive and negative examples, to help you get started:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "negative_examples = df_train.query(\"target == 0\")\n",
    "positive_examples = df_train.query(\"target == 1\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1(e)\n",
    "rubric={points:3}\n",
    "\n",
    "Let's say you had to make a decision stump (decision tree with depth 1), _by hand_, to predict the target class. Just from looking at the plots above, describe a reasonable split (feature name and threshold) and what class you would predict in the two cases. For example, in the loudness histogram provided earlier on, it seems that very large values of loudness are generally disliked (more blue on the right side of the histogram), so you might answer something like this: \"A reasonable split would be to predict 0 if loudness > -5 (and predict 1 otherwise).\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1(f)\n",
    "rubric={points:2}\n",
    "\n",
    "Let's say that, for a particular feature, the histograms of that feature are identical for the two target classes. Does that mean the feature is not useful for predicting the target class?\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-86f9e0c649669daf",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "#### 1(g) \n",
    "rubric={points:2}\n",
    "\n",
    "Note that the dataset includes two free text features labeled `song_title` and `artist`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_train[[\"song_title\", \"artist\"]].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Do you think these features could be useful in predicting whether the user liked the song or not? \n",
    "- Would there be any difficulty in using them in your model?   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "cell-dce517defdc16360",
     "locked": false,
     "points": 0,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-1440876fbc49ead5",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "## Exercise 2: Using sklearn to build a decision tree classifier"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-706403e72adade4b",
     "locked": true,
     "schema_version": 3,
     "solution": false
    }
   },
   "source": [
    "#### 2(a) \n",
    "rubric={points:3}\n",
    "\n",
    "- Create `X_train` and `y_train` and `X_test` and `y_test` from the Spotify dataset. Skip the `song_title` and `artist` features for now. \n",
    "- Fit a `DecisionTreeClassifier` on the train set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "cell-859d4a70667da85d",
     "locked": false,
     "points": 0,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-43ac6f91bc3bd9da",
     "locked": true,
     "schema_version": 3,
     "solution": false
    }
   },
   "source": [
    "#### 2(b)\n",
    "rubric={points:2}\n",
    "\n",
    "Use the `predict` method to predict the class of the first example in your `X_train`. Is the prediction correct? That is, does it match with the corresponding classes in `y_train`?  \n",
    "\n",
    "Note: you can grab the first example with `X_train.iloc[[0]]`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2(c) \n",
    "rubric={points:2}\n",
    "\n",
    "Use the `cross_val_score` function on your training set to compute the 10-fold cross-validation accuracy of your tree. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2(d)\n",
    "rubric={points:1}\n",
    "\n",
    "The above is useful, but we would like to see the training accuracy as well. \n",
    "\n",
    "- Compute the 10-fold cross-validation again but this time using the `cross_validate` function with `return_train_score=True`. \n",
    "- Print out both the cross-validation score and the training score.\n",
    "- Is your cross-validation score exactly the same as what you got in the previous part? Very briefly discuss."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2(e)\n",
    "rubric={points:1}\n",
    "\n",
    "Do you see a significant difference between the training score and the cross-validation score? Briefly discuss."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "cell-a89757274fc5586f",
     "locked": false,
     "points": 0,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2(f)\n",
    "rubric={points:1}\n",
    "\n",
    "Inspect the 10 sub-scores from the 10 folds of cross-validation. How does this inform the trustworthiness of your cross validation score?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "cell-a89757274fc5586f",
     "locked": false,
     "points": 0,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-4150979c1845a18c",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "## Exercise 3: Hyperparameters \n",
    "rubric={points:10}\n",
    "\n",
    "In this exercise, you'll experiment with the `max_depth` hyperparameter of the decision tree classifier. See the [`DecisionTreeClassifier` documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) for more details.\n",
    "\n",
    "- Explore the `max_depth` hyperparameter. Run 10-fold cross-validation for trees with different values of `max_depth` (at least 10 different values in the range 1 to 25).\n",
    "- For each `max_depth`, get both the train accuracy and the cross-validation accuracy.\n",
    "- Make a plot with `max_depth` on the *x*-axis and the train and cross-validation scores on the *y*-axis. That is, your plot should have two curves, one for train and one for cross-validation. Include a legend to specify which is which.\n",
    "- Discuss how changing the `max_depth` hyperparameter affects the training and cross-validation accuracy. From these results, what depth would you pick as the optimal depth? \n",
    "- Do you think that the depth you chose would generalize to other \"spotify\" datasets (i.e., data on other spotify users)?\n",
    "\n",
    "Note: generally speaking (for all assignments) you are welcome to copy/paste code directly from the lecture notes, though I ask that you add a small citation (e.g. \"Adapted from lecture 2\") if you do so."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 4: Test set\n",
    "rubric={points:4}\n",
    "\n",
    "Remember the test set you created way back at the beginning of this assignment? Let's use it now to see if our cross-validation score from the previous exercise is trustworthy. \n",
    "\n",
    "- Select your favorite `max_depth` from the previous part.\n",
    "- Train a decision tree classifier using that `max_depth` on the _entire training set_.\n",
    "- Compute and display the test score. \n",
    "- How does it compare to the cross-validation score from the previous exercise? Briefly discuss. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 5: Conceptual questions\n",
    "rubric={points:3}\n",
    "\n",
    "Consider the dataset below, which has $6$ examples and $2$ features:\n",
    "\n",
    "$$ X = \\begin{bmatrix}5 & 2\\\\4 & 3\\\\  2 & 2\\\\ 10 & 10\\\\ 9 & -1\\\\ 9& 9\\end{bmatrix}, \\quad y = \\begin{bmatrix}-1\\\\-1\\\\+1\\\\+1\\\\+1\\\\+1\\end{bmatrix}.$$\n",
    "\n",
    "1. Say we fit a decision stump (depth 1 decision tree) and the first split is on the first feature (left column) being less than 5.5. What would we predict in the \"true\" and \"false\" cases here?\n",
    "2. What training accuracy would the above stump get on this data set?\n",
    "3. Can we obtain 100% accuracy with a single decision stump?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Submission to Canvas\n",
    "\n",
    "When you are ready to submit your assignment do the following:\n",
    "\n",
    "1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`.\n",
    "2. Save your notebook.\n",
    "3. Convert your notebook to `.html` format using the `convert_notebook()` function below **or** by `File -> Export Notebook As... -> Export Notebook to HTML`.\n",
    "4. Run the code `submit()` below to go through an interactive submission process to Canvas.\n",
    ">For this step, you will need a Canvas *Access Token* token. If you haven't already got one, log-in to Canvas, click `Account` (top-left of the screen), then `Settings`, then scroll down until you see the `+ New Access Token` button. Click that button, give your token any name you like and set the expiry date to Dec 31, 2020. Then click `Generate token`. Save this token in a safe place on your computer as you'll need it for all assignments. Treat the token with as much care as you would an important password. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from canvasutils.submit import submit, convert_notebook\n",
    "\n",
    "# Note: the canvasutils package should have been installed as part of your environment setup - \n",
    "# see https://github.com/UBC-CS/cpsc330/blob/master/docs/setup.md"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# convert_notebook(\"hw2.ipynb\", \"html\")  # uncomment and run when you want to try convert your notebook to HTML (or you can convert manually from the File menu)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# submit(course_code=53561, token=False)  # uncomment and run when ready to submit "
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "celltoolbar": "Create Assignment",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
